Summary

  • Context {el verdadero, para cuando saint_analytics este andando.}
    • saint_sports is a segment from saint_analytics that shares data analysis on many aspects of sports, with an emphasis on football (the only “football” there is ;))
  • Context
    • This project belongs to Santiago Florimontes’ portfolio, and is part of saint_analytics, a business idea in the making.
  • What you’ll find
    • A success index for the most popular sports competition of all: The FIFA world cup
    • Facts and visuals around the index
    • R code used to extract information from the web and make the visuals (click the ’ code ’ button at the right side of the visuals to check it out)
    • Tables containing datasets for the visuals, sources, and credits

The WCS_I

How to measure success in football?

To evaluate or measure ‘something’ one has to understand it first. So, to build the index the first thing to do is to answer the following: what is success in football?*
As in any other sport, the simplest definition of success is to win. But due to how football is played and ruled, success could lose its simplicity.
Depending on who’s asked, success could mean winning at all costs or winning only in a certain fashion/style. So, to avoid ambiguity, the index was crafted thinking as it was implied above:
success = win**

WCS_I explanation

  • WCS_I stands for ‘World Cup Success Index’. *It covers the last 7 world cups (from 1998 in France to 2022 in Qatar). The reason being is that the competition had different formats before ’98, and thus WCS_I wouldn’t be valid unless modified.
  • Winning the World Cup is a total success, or 100% success, the equivalent of 1 point.
    Teams start with a minimum score of 0.20 due to a ‘ qualifying effort ’.
    After the group stage - 1st stage in the competition - teams increase their score evenly (+ 0.16) as they move on in the competition.
  • The score increases evenly to avoid arbitrariness, such as punishing “bad wins” or rewarding “good losses”.
  • Third place contest is ignored because teams involved can’t win the world cup.
ggplot(ip_df, aes(x = achieved_stage_ip, y = s_points_ip)) +
  geom_col(fill = "#3399CC", alpha = 0.6) +
  scale_y_continuous(limits = c(-0.04,1.3), breaks = s_points_ip) +
  labs(title = "How WCS_I works", 
       subtitle = "WCS_I applied to current world champions: Argentina (a.k.a 'La Scaloneta')",
       x = '',
       y = "WCS_I") + 
  theme(plot.title = element_text(size = 18, face = 'bold', hjust = 0), 
        plot.subtitle = element_text(size = 14, face = 'bold', hjust = 0),
        axis.text = element_text(size = 14), axis.ticks.y = element_blank(), 
        panel.grid.major.y = element_line(color = 'gray', linewidth = .25, 
                                          linetype = "dashed"), 
        panel.background = element_blank()) +
  geom_segment(data = data.frame(x = 1, xend = 5.8, y = -.04, yend = -.04), 
               aes(x = x, xend = xend, y = y, yend = yend), 
           color = "#3399CC", alpha = 0.85,
           linewidth = 4, linejoin = c('mitre'),
           arrow = arrow(angle = 10, length = unit(0.4,'inches'))) +
  draw_image(img_wc_1, x = 5.5, y = 1, height = .25, scale = 1) +
  geom_label(data = arg_expl_summ_df[1,], 
             aes(x = 1, 
                 y = y_coord -0.18,  label = summ), 
             fill = "#3399CC", alpha = 0.1, size = 3) + 
  geom_label(data = arg_expl_summ_df[-1,], 
             aes(x = c(1,2,3,4,5), 
                 y = y_coord -0.18,  label = summ), 
             fill = "#3399CC", alpha = 0.6, size = 3) +
  annotation_custom(img_saint_1 %>% rasterGrob(), 
                    xmin=5.2, xmax=6.2, ymin=1.35, ymax=1.55) +
  coord_cartesian(clip = "off")

WCS_I results: Map and Table

p <- list(projection = list(type = 'natural earth'), 
          showcountries = TRUE, countrycolor = "black", 
          showland = TRUE, landcolor = "white")


plot_geo(wc_map_wcs_i, locationmode = 'code', locations = ~ISO_codes) %>% 
  add_trace(z = ~sum_s_points, text = text_map, hoverinfo = "text", 
            color = ~sum_s_points, colors = 'Blues', 
            marker = list(line = list(color = "green", width = .5))) %>% 
  add_annotations(text = paste("~ 5 countries where excluided from the map: England, Scotland, Serbia & Montenegro, Yugoslavia, and Wales.", "* British could not be set appart, and thus were aggregated as United Kingdom in the map.","* Serbia & Montenegro, and Yugoslavia don't exist anymore.", sep = "\n"), showarrow = F, x = .5, y = -0.1) %>%
  colorbar(title = "Index sum", x = 1, y = .8) %>% 
  layout(title = paste("<b>WCS_I around the globe<br>(1998 - 2022)</b>"), geo = p,
         images = list(list(source = raster2uri(pen), 
                             xref = "paper", yref = "paper", x= .85, y= 1.06, 
                             sizex = .17, sizey = .14)))
  • Among the top 10, there are 3 American and 7 European countries.
    • Believe it or not, Mexico makes it to the top 10 while others with greater football culture like Belgium, Italy, or Uruguay don’t. The best part is that Mexicans never got further than ‘Last 16’ in a World Cup, but they’ve been consistent and always got to that stage in every competition from 1998 up until 2022, when they went home early.
      Mexicans’ presence in every world cup is favored by the lack of difficulty they face during qualifying rounds (maybe the worst qualifiers of all quality-wise).
    • England makes it to the top 5. It came as a surprise for a couple of reasons:
      1. WCS_I ranks them above champions Italy and Spain.
      2. Their playing style has been increasingly passive, prioritizing not conceding goals over scoring them.
        Again, consistency does the trick for them: they’ve always been present and reached relatively high stages on every occasion, while the previously mentioned champions didn’t.
    • While Consistency is there for England and Mexico, the latest never got as far in the competition as the first. One could say that’s due to a quality gap between the nations.
  • In terms of football culture and the understanding of what success means in the game, the table is diverse enough to acknowledge that there is no unique approach to how to win.
    The previous said, the attacking/positive approach to the game is predominant… Argentina, Brazil, Germany, and the Netherlands are examples of it.

Appendix

Success…?

WCS_I is a simple alternative to understanding how national teams performed in the World Cup for the last couple of decades or so. What makes it simple is its only premise: to be successful one should win.

Not everybody feels comfortable with such a premise because it lacks a ‘how to win’ component, in other words: It doesn’t consider the style of play or tactics involved in the result. Of course, the index creator empathizes with the sentiment, but including this component has the following issues…

1. How could anybody weigh a style of play over another? If at the end of the day, every known style of play delivered results, even in modern football.
2. There is no data available to capture the style of play of every team in the sample. And even if it was, only figuring out the preferred tactics of every manager and if they can put it to work on the pitch is a lot of work.

The visual above shows plausible understandings of success and how the component mentioned before (how to win) gets involve in the question ‘What is success in football?’.

2nd Map

In case the previous map isn’t realistic enough, here’s another one with a higher resolution.
Higher resolution could delay manipulation and navigation on the map.

data sets

  • Map dataset
wc_map_wcs_i %>% select(., -c(n.x, n.y)) %>% datatable(rownames = F, 
                              extensions = 'Buttons',
                              options = list(dom = 'Blfrtip',
                                             buttons = c('copy', 'csv', 'excel'), 
                           pageLength = 4, lengthMenu = c(1, 2, 4)))
  • Summary dataset
wc_map %>% select(., -c(n.x, n.y)) %>% datatable(rownames = F, 
                              extensions = 'Buttons',
                              options = list(dom = 'Blfrtip',
                                             buttons = c('copy', 'csv', 'excel'), 
                           pageLength = 4, lengthMenu = c(4, 8, 12)))
  • Long dataset
wc_all_labeled_long %>% select(., -n) %>% datatable(rownames = F, 
                              extensions = 'Buttons',
                              options = list(dom = 'Blfrtip',
                                             buttons = c('copy', 'csv', 'excel'), 
                           pageLength = 4, lengthMenu = c(4, 8, 12)))
  • Wide dataset
wc_all_labeled_wide_sum %>% select(., -n) %>% datatable(rownames = F, 
                              extensions = 'Buttons',
                              options = list(dom = 'Blfrtip',
                                             buttons = c('copy', 'csv', 'excel'), 
                           pageLength = 4, lengthMenu = c(4, 8, 12)))

imagery credits

reactable(img_cred, pagination = FALSE, highlight = TRUE, 
          height = 175,
          columns = list(
    imagery_tag = colDef(name = "Image label"),
    imagery_credit = colDef(name = "Credits (w/link)", 
                            html = TRUE, 
                            cell = function(value, index) {
      sprintf('<a href="%s" target="_blank">%s</a>', img_cred$imagery_link[index], value)
    }), 
    imagery_link = colDef(show = F) 
  )
)

Web scraping code (getting data directly from sites)

  • FBREF (click the ‘code’ buttom on the side)
 # Unify: 1998 ~ 2018. ---------
  
  # 2022 will be incluided as soon as fbref updates it's site...
  # wc_year <- seq(1998,2022,4) %>% as.character() %>% paste(., collapse = "|")

LINK <- "https://fbref.com/en/comps/"

read_html(LINK) -> wc_access

wc_year <- seq(1998,2018,4) %>% as.character(.) %>% 
  paste(., collapse = "|") %>% paste0(., "|1/World-Cup-Stats")

wc_links <- wc_access %>%
  html_nodes("div table#comps_intl_fa_nonqualifier_senior tr.gender-m th a") %>% 
  html_attr("href") %>% paste0("https://fbref.com", . ) %>%
  .[str_detect(. , "World-Cup", negate = F)] %>%  
  read_html() %>% html_nodes("div th a") %>% html_attr("href") %>%
  paste0("https://fbref.com", . ) %>% .[str_detect(. , wc_year, negate = F)]


  # Got 3 groups... R16, Champions (a.k.a Winners), and group stage

  # r16

wc_links_r16 <- lapply(wc_links, function(i){
  read_html(i) %>% html_nodes("div.matchup-team a") %>% html_text() %>% .[17:32] %>% 
  as.data.frame()
  }) 

  # winners
wc_links_winners <- lapply(wc_links, function(i){
  read_html(i) %>% html_nodes("div.match-summary div.matchup-team") %>% 
  .[str_detect(. , "winner", negate = F)] %>% html_nodes("a") %>% html_text() %>%
    .[-2] %>% as.data.frame()
  }) 
  
  # gs
wc_links_gs <- lapply(wc_links, function(i){
  read_html(i) %>% html_nodes("div div.section_wrapper table tbody tr td a") %>%  
    html_text() %>% as.data.frame()
  })  #YEP!
  # 2nd web scraping for ISO codes (needed to map countries) -------
 
LINK.2 <- "https://countrycode.org/"

read_html(LINK.2) -> iso.iso

iso.1 <- iso.iso %>% 
  html_nodes("div table tbody tr a") %>% html_text() 

iso.2 <- iso.iso %>% 
  html_nodes("div table tbody tr td") %>% .[seq(3,2160,6)] %>% 
  html_text() %>% substring(., 6, 8)

# bind_cols(iso.1, iso.2)

# iso.1 %>% length()/2
# iso.2 %>% length()

# as.data.frame(iso.2) %>% count(iso.2) # Found duplicates

# iso.1 %>% unique() %>% length()
# iso.2 %>% unique() %>% length() 

# detect and solve duplication issue. 

iso.1 <- iso.1 %>% unique() 
iso.2 <- iso.2 %>% unique() # done...

ISO_tab <- bind_cols("countries" = iso.1, "ISO_codes" = iso.2)

IOS_rebels <- ISO_tab %>% 
  anti_join(., st_transform(ne_countries(scale = 'medium', type = 'countries', 
                                         returnclass = 'sf')) %>%
           select("name", "continent"), by = c("countries" = "name" )) %>% 
  .[1] %>% unlist() %>% as.character()
 
reb_corrected_ISO_list <- c("Antigua and Barb.", "Bosnia and Herz.", "Invalid", 
                              "British Virgin Is.", "Cayman Is.", 
                              "Central African Rep.", "Invalid", "Invalid", "Cook Is.",
                              "Curaçao", "Czech Rep.", "Dem. Rep. Congo", "Dominican Rep.",
                              "Timor-Leste", "Eq. Guinea", "Falkland Is.", "Faeroe Is.", 
                              "Fr. Polynesia", "Invalid", "Côte d'Ivoire", "Lao PDR", "Macao",
                              "Marshall Is.", "Invalid", "Invalid", "Dem. Rep. Korea", 
                              "N. Mariana Is.", "Invalid", "Congo", "Invalid", "St-Barthélemy",
                              "St. Kitts and Nevis", "St-Martin", "St. Pierre and Miquelon", 
                              "St. Vin. and Gren.", "São Tomé and Principe", "Solomon Is.", 
                              "Korea", "S. Sudan", "Invalid", "Invalid", "Turks and Caicos Is.", 
                              "Invalid", "U.S. Virgin Is.", "Wallis and Futuna Is.", "W. Sahara")


# length(IOS_rebels) == length(reb_corrected_ISO_list) logical check

ISO_tab[1] <- sapply(ISO_tab[1], function(o)
  replace(o, o %in% IOS_rebels, reb_corrected_ISO_list))